{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# MXNet for Collaborative Deep Learning in Recommender Systems\n",
    "In this tutorial, we build on MXNet to implement the Collaborative Deep Learning (CDL) [1] model for recommender systems.\n",
    "\n",
    "## Brief Introduction of CDL\n",
    "\n",
    "In CDL, a probabilistic stacked denoising autoencoder (pSDAE) is connected to a matrix factorization (MF) component. Model training will alternate between pSDAE and MF. In each epoch, a pSDAE with a reconstruction target at the end and a regression target in the bottleneck will be udpated before updating the latent factors U and V in the regularized MF.\n",
    "\n",
    "Below is the graphical model for CDL. The part in the red rectangle is pSDAE and the rest is the MF component regularized by pSDAE. Essentially, the updating will alternate between pSDAE (updating $W^+$) and the MF component (updating $u$ and $v$).\n",
    "### Some Notation:\n",
    "- $x_0$: the input vectors to pSDAE (corrupted data, e.g., randomly deleting some entries of the input)\n",
    "- $x_c$: the reconstruction target vectors (the uncorrupted data)\n",
    "- $x_{L/2}$: the output vectors of pSDAE's middle layer (bottleneck layer)\n",
    "- $X_0$: the matrix consists of vectors $x_0$\n",
    "- $X_c$: the matrix consists of vectors $x_c$\n",
    "- $W^+$: weights and biases of pSDAE\n",
    "- $v$: latent item vectors\n",
    "- $u$: latent user vectors\n",
    "- $R$: rating matrix ('1' if the article is in the user's library and '0' otherwise)\n",
    "- $I$: number of users\n",
    "- $J$: number of items\n",
    "- $\\lambda_u$, $\\lambda_v$, $\\lambda_w$, $\\lambda_n$: hyperparameters\n",
    "\n",
    "![](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/cdl/PGM-CDL.png)\n",
    "\n",
    "Below we show a special case of CDL (from a neural network perspective), where it degenerates to simultaneously training two neural networks overlaid together with a common input layer (the corrupted input) but different output layers. This might be a lot easier to understand for people not familiar with graphical models. \n",
    "\n",
    "![](https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/cdl/NN-CDL.png)\n",
    "\n",
    "The objective function (which we use in this implementation) for this special case is:\n",
    "\\begin{align}\n",
    "\\mathscr{L}=&-\\frac{\\lambda_u}{2}\\sum\\limits_i \\|u_i\\|_2^2\n",
    "-\\frac{\\lambda_w}{2}\\sum\\limits_l(\\|W_l\\|_F^2+\\|b_l\\|_2^2)\\nonumber \\\\\n",
    "&-\\frac{\\lambda_v}{2}\\sum\\limits_j\\|v_j-f_e(X_{0,j*},W^+)^T\\|_2^2 \\nonumber \\\\\n",
    "&-\\frac{\\lambda_n}{2}\\sum\\limits_j\\|f_r(X_{0,j*},W^+)-X_{c,j*}\\|_2^2 \\nonumber \\\\\n",
    "&-\\sum\\limits_{i,j}\\frac{C_{ij}}{2}(R_{ij}-u_i^Tv_j)^2,\n",
    "\\end{align}\n",
    "where the encoder function $f_e(\\cdot,W^+)$ takes the corrupted content vector $X_{0,j*}$ of item $j$ as input and computes the encoding of the item, and the function $f_r(\\cdot,W^+)$ also takes $X_{0,j*}$ as input, computes the encoding and then the reconstructed content vector of item $j$. Here $\\lambda_w$, $\\lambda_n$, $\\lambda_u$, and $\\lambda_v$ are hyperparameters and $C_{ij}$ is a confidence parameter ($C_{ij} = a$ if $R_{ij}=1$ and $C_{ij}=b$ otherwise). For example, if the number of layers $L=6$, $f_e(X_{0,j*},W^+)$ is the output of the third layer while $f_r(X_{0,j*},W^+)$ is the output of the sixth layer.\n",
    "\n",
    "To learn CDL, we have to implement the block coordinate descent (BCD) update using numpy/mshadow and call this BCD procedure after each epoch of pSDAE. Besides the MF part, another difference between CDL and conventional deep learning models is that pSDAE has a fixed target at the end and a dynamic target (the latent item factors V) in the bottleneck layer. It might need some hacking to make this work.\n",
    "\n",
    "[1] H. Wang, N. Wang, and D. Yeung. Collaborative deep learning for recommender systems. In KDD, 2015."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Implementing CDL in MXNet for Recommender Systerms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import mxnet as mx\n",
    "import numpy as np\n",
    "import logging\n",
    "import data\n",
    "from math import sqrt\n",
    "from autoencoder import AutoEncoderModel\n",
    "import os"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Setting Hyperparameters\n",
    "- lambda_u: regularization coefficent for user latent matrix U\n",
    "- lambda_v: regularization coefficent for item latent matrix V\n",
    "- K: number of latent factors\n",
    "- is_dummy: whether to use a dummy dataset for demo\n",
    "- num_iter: number of iterations (minibatches) to train (a epoch in the used dataset takes about 68 iterations)\n",
    "- batch_size: minibatch size\n",
    "- dir_save: directory to save training results\n",
    "- lv: lambda_v/lambda_n in CDL; this controls the trade-off between reconstruction error in pSDAE and recommendation accuracy during training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "lambda_u = 1 # lambda_u in CDL\n",
    "lambda_v = 10 # lambda_v in CDL\n",
    "K = 50\n",
    "p = 1\n",
    "is_dummy = False\n",
    "num_iter = 100 # about 68 iterations/epoch, the recommendation results at the end need 100 epochs\n",
    "batch_size = 256\n",
    "\n",
    "np.random.seed(1234) # set seed\n",
    "lv = 1e-2 # lambda_v/lambda_n in CDL\n",
    "dir_save = 'cdl%d' % p"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create the directory and the log file."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "p1: lambda_v/lambda_u/ratio/K: 10.000000/1.000000/0.010000/50\n"
     ]
    }
   ],
   "source": [
    "if not os.path.isdir(dir_save):\n",
    "    os.system('mkdir %s' % dir_save)\n",
    "fp = open(dir_save+'/cdl.log','w')\n",
    "print 'p%d: lambda_v/lambda_u/ratio/K: %f/%f/%f/%d' % (p,lambda_v,lambda_u,lv,K)\n",
    "fp.write('p%d: lambda_v/lambda_u/ratio/K: %f/%f/%f/%d\\n' % \\\n",
    "        (p,lambda_v,lambda_u,lv,K))\n",
    "fp.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading Data\n",
    "Here we load the text information (as input to pSDAE) in the file mult.dat and the rating matrix (as input for the MF part) in the file cf-train-1-users.dat. Code for loading the data are packed in data.py.\n",
    "\n",
    "We use the CiteULike dataset here. The input text is bag-of-words vectors normalized to [0,1]. Some details:\n",
    "- task: recommend articles to users\n",
    "- number of users: 5551\n",
    "- number of items: 16980\n",
    "- number of ratings for training: ~169800\n",
    "- number of terms: 8000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# download data\n",
    "import os\n",
    "data_url='https://raw.githubusercontent.com/dmlc/web-data/master/mxnet/cdl'\n",
    "for filename in ('mult.dat', 'cf-train-1-users.dat', 'cf-test-1-users.dat', 'raw-data.csv'):\n",
    "    if not os.path.exists(filename):\n",
    "        os.system(\"wget %s/%s\" % (data_url, filename))\n",
    "# read data\n",
    "X = data.get_mult()\n",
    "R = data.read_user()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Network Definition\n",
    "Here we deine the logging level and construct the network. As mentioned before, pSDAE has multiple targets, we used `mx.symbol.Group` to group both loss. The codes snippets are shown as following, refer to [autoencode.py](autoencoder.py) for more details. \n",
    "\n",
    "```python\n",
    "fe_loss = mx.symbol.LinearRegressionOutput(\n",
    "            data=self.lambda_v_rt*self.encoder,\n",
    "            label=self.lambda_v_rt*self.V)\n",
    "fr_loss = mx.symbol.LinearRegressionOutput(\n",
    "            data=self.decoder, label=self.data)\n",
    "loss = mx.symbol.Group([fe_loss, fr_loss]) \n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "logging.basicConfig(level=logging.INFO)\n",
    "cdl_model = AutoEncoderModel(mx.cpu(2), [X.shape[1],100,K],\n",
    "    pt_dropout=0.2, internal_act='relu', output_act='relu')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Initializing Variables\n",
    "Here we initialize several variables. V is the latent item matrix and lambda_v_rt is an ndarray with entries equal to sqrt(lv). We need this lambda_v_rt to hack the trade-off between two targets in pSDAE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "train_X = X\n",
    "V = np.random.rand(train_X.shape[0],K)/10\n",
    "lambda_v_rt = np.ones((train_X.shape[0],K))*sqrt(lv)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Training the CDL\n",
    "Train the whole CDL (joint training of pSDAE and the connected MF). We use SGD for pSDAE and BCD for the MF part. U is the user latent matrix, V is the item latent matrix, theta is the output of pSDAE's middle layer, and BCD_loss equals to rating_loss+reg_loss_for_U+reg_loss_for_V. For demostration we train for only 100 iterations (about 1.5 epochs) here. The shown recommendations in later parts are results after 100 epochs.\n",
    "\n",
    "The function finetune below will call the function 'solve' in the solver.py, where the customized training loop resides. In the training loop, we call the following code after each epoch of pSDAE to update U and V using BCD. The BCD updating procedure is wrapped up in the function BCD_one. Note that after each epoch, we upate U and V for only one iteration.\n",
    "\n",
    "```python\n",
    "theta = model.extract_feature(sym[0], args, auxs,\n",
    "    data_iter, X.shape[0], xpu).values()[0]\n",
    "# update U, V and get BCD loss\n",
    "U, V, BCD_loss = BCD_one(R, U, V, theta,\n",
    "    lambda_u, lambda_v, dir_save, True)\n",
    "# get recon' loss\n",
    "Y = model.extract_feature(sym[1], args, auxs,\n",
    "    data_iter, X.shape[0], xpu).values()[0]\n",
    "Recon_loss = lambda_v/np.square(lambda_v_rt_old[0,0])*np.sum(np.square(Y-X))/2.0\n",
    "lambda_v_rt[:] = lambda_v_rt_old[:] # back to normal lambda_v_rt\n",
    "data_iter = mx.io.NDArrayIter({'data': X, 'V': V, 'lambda_v_rt':\n",
    "    lambda_v_rt},\n",
    "    batch_size=batch_size, shuffle=False,\n",
    "    last_batch_handle='pad')\n",
    "data_iter.reset()\n",
    "batch = data_iter.next()\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "INFO:root:Fine tuning...\n",
      "INFO:root:Iter:0 metric:0.001668\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Epoch 1 - tr_err/bcd_err/rec_err: 53641376.1/27755.0/53613621.1\n"
     ]
    }
   ],
   "source": [
    "U, V, theta, BCD_loss = cdl_model.finetune(train_X, R, V, lambda_v_rt, lambda_u,\n",
    "        lambda_v, dir_save, batch_size,\n",
    "        num_iter, 'sgd', l_rate=0.1, decay=0.0,\n",
    "        lr_scheduler=mx.misc.FactorScheduler(20000,0.1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Saving Models and Parameters\n",
    "Save the network (pSDAE) parameters, latent matrices, and middle-layer output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "cdl_model.save(dir_save+'/cdl_pt.arg')\n",
    "np.savetxt(dir_save+'/final-U.dat.demo',U,fmt='%.5f',comments='')\n",
    "np.savetxt(dir_save+'/final-V.dat.demo',V,fmt='%.5f',comments='')\n",
    "np.savetxt(dir_save+'/final-theta.dat.demo',theta,fmt='%.5f',comments='')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Computing Training Error\n",
    "The training loss consists of the loss in pSDAE and that in MF."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Training error: 53629559.864\n"
     ]
    }
   ],
   "source": [
    "Recon_loss = lambda_v/lv*cdl_model.eval(train_X,V,lambda_v_rt)\n",
    "print \"Training error: %.3f\" % (BCD_loss+Recon_loss)\n",
    "fp = open(dir_save+'/cdl.log','a')\n",
    "fp.write(\"Training error: %.3f\\n\" % (BCD_loss+Recon_loss))\n",
    "fp.close()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generating Recommendations\n",
    "Load the latent matrices (U and V), compute the predicted ratings R=UV^T, and generate recommendation lists for each user. There 5551 users in the dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "User 100\n",
      "User 200\n",
      "User 300\n",
      "User 400\n",
      "User 500\n",
      "User 600\n",
      "User 700\n",
      "User 800\n",
      "User 900\n",
      "User 1000\n",
      "User 1100\n",
      "User 1200\n",
      "User 1300\n",
      "User 1400\n",
      "User 1500\n",
      "User 1600\n",
      "User 1700\n",
      "User 1800\n",
      "User 1900\n",
      "User 2000\n",
      "User 2100\n",
      "User 2200\n",
      "User 2300\n",
      "User 2400\n",
      "User 2500\n",
      "User 2600\n",
      "User 2700\n",
      "User 2800\n",
      "User 2900\n",
      "User 3000\n",
      "User 3100\n",
      "User 3200\n",
      "User 3300\n",
      "User 3400\n",
      "User 3500\n",
      "User 3600\n",
      "User 3700\n",
      "User 3800\n",
      "User 3900\n",
      "User 4000\n",
      "User 4100\n",
      "User 4200\n",
      "User 4300\n",
      "User 4400\n",
      "User 4500\n",
      "User 4600\n",
      "User 4700\n",
      "User 4800\n",
      "User 4900\n",
      "User 5000\n",
      "User 5100\n",
      "User 5200\n",
      "User 5300\n",
      "User 5400\n",
      "User 5500\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "from data import read_user\n",
    "def cal_rec(p,cut):\n",
    "    R_true = read_user('cf-test-1-users.dat')\n",
    "    dir_save = 'cdl'+str(p)\n",
    "    #U = np.mat(np.loadtxt(dir_save+'/final-U.dat'))\n",
    "    #V = np.mat(np.loadtxt(dir_save+'/final-V.dat'))\n",
    "    R = U*V.T\n",
    "    num_u = R.shape[0]\n",
    "    num_hit = 0\n",
    "    fp = open(dir_save+'/rec-list.dat','w')\n",
    "    for i in range(num_u):\n",
    "        if i!=0 and i%100==0:\n",
    "            print 'User '+str(i)\n",
    "        l_score = R[i,:].A1.tolist()\n",
    "        pl = sorted(enumerate(l_score),key=lambda d:d[1],reverse=True)\n",
    "        l_rec = list(zip(*pl)[0])[:cut]\n",
    "        s_rec = set(l_rec)\n",
    "        s_true = set(np.where(R_true[i,:]>0)[1].A1)\n",
    "        cnt_hit = len(s_rec.intersection(s_true))\n",
    "        fp.write('%d:' % cnt_hit)\n",
    "        fp.write(' '.join(map(str,l_rec)))\n",
    "        fp.write('\\n')\n",
    "    fp.close()\n",
    "\n",
    "cal_rec(1,8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Show Recommendations\n",
    "Load the article titles (raw-data.csv), ratings (cf-train-1-users.dat and cf-test-1-users.dat), and recommendation lists (rec-list.dat)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import csv\n",
    "from data import read_user\n",
    "import numpy as np\n",
    "p = 1\n",
    "# read predicted results\n",
    "dir_save = 'cdl%d' % p\n",
    "csvReader = csv.reader(open('raw-data.csv','rb'))\n",
    "d_id_title = dict()\n",
    "for i,row in enumerate(csvReader):\n",
    "    if i==0:\n",
    "        continue\n",
    "    d_id_title[i-1] = row[3]\n",
    "R_test = read_user('cf-test-1-users.dat')\n",
    "R_train = read_user('cf-train-1-users.dat')\n",
    "fp = open(dir_save+'/rec-list.dat')\n",
    "lines = fp.readlines()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Show the titles of articles in the training set and titles of recommended articles. Correctly recommended articles are marked by asterisks."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "##########  User 3  ##########\n",
      "\n",
      "#####  Articles in the Training Sets  #####\n",
      "Formal Ontology and Information Systems\n",
      "Ontologies: a silver bullet for knowledge management and electronic commerce\n",
      "Business Process Execution Language for Web Services version 1.1\n",
      "Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI\n",
      "A cookbook for using the model-view controller user interface paradigm in Smalltalk-80\n",
      "Object-oriented application frameworks\n",
      "Data integration: a theoretical perspective\n",
      "Web services: been there, done that?\n",
      "Sweetening Ontologies with DOLCE\n",
      "Naive Geography\n",
      "\n",
      "#####  Articles Recommended (Correct Ones Marked by Asterisks)  #####\n",
      "The Wisdom of Crowds\n",
      "Nexus: Small Worlds and the Groundbreaking Theory of Networks\n",
      "VisANT: an integrative framework for networks in systems biology\n",
      "The Essays of Warren Buffett : Lessons for Corporate America\n",
      "What's new in pediatric orthopaedics.\n",
      "Doing with Understanding: Lessons from Research on Problem- and Project-Based Learning\n",
      "Predictably Irrational: The Hidden Forces That Shape Our Decisions\n",
      "The Predictably Irrational CD: The Hidden Forces That Shape Our Decisions\n"
     ]
    }
   ],
   "source": [
    "user_id = 3\n",
    "s_test = set(np.where(R_test[user_id,:]>0)[1].A1)\n",
    "l_train = np.where(R_train[user_id,:]>0)[1].A1.tolist()\n",
    "l_pred = map(int,lines[user_id].strip().split(':')[1].split(' '))\n",
    "print '##########  User '+str(user_id)+'  ##########\\n'\n",
    "print '#####  Articles in the Training Sets  #####'\n",
    "for i in l_train:\n",
    "    print d_id_title[i]\n",
    "print '\\n#####  Articles Recommended (Correct Ones Marked by Asterisks)  #####'\n",
    "for i in l_pred:\n",
    "    if i in s_test:\n",
    "        print '* '+d_id_title[i]\n",
    "    else:\n",
    "        print d_id_title[i]\n",
    "fp.close()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
